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REMARKS 

In the patent application, claims 1-24 are pending. In the office action, all pending claims 
are rejected. 

Applicant has amended claim 2 to change "the decoder" to "a decoder" as suggested by 
the Examiner. Applicant has also amended claim 16 to correct for a typographical error. 
No new matter has been introduced. 

At section 2 of the office action, claim 2 is objected to because of informalities. 
Applicant has amended claim 2 to overcome the objection. 

At section 4, claims 1-5, 7-12, 15, 17 and 20 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Lee et al ("A very low bit rate speech coder based on a recognition/synthesis 
paradigm" IEEE Trans on Speech and Audio Processing, Vol. 9, No. 5, July 2001, hereafter 
referred to as Zee), in view oiGao (U.S. Patent No. 6,449,590). 

In rejecting claims 1 , 11, 17 and 20, the Examiner states that Lee discloses a method and 
system for improving coding efficiency having the following steps: 

creating a plurality of simplified pitch contour segment candidates, each candidate 
corresponding to a sub-segment of the audio signal (Section V.A., pages 486-487); 

measuring deviation between each of the simplified pitch contour segment candidates and 
the pitch values in the corresponding sub-segment; and 

selecting a plurality of consecutive segment candidates to represent the audio segment 
(Section V.A., Pages 486-487; Figure 5); and 

coding the pitch contour data in the sub-segments of the audio signal corresponding to the 
selected segment candidates (Section V. Page 486). 

The Examiner admits that Lee fails to specifically suggest that the start and end points of 
a pitch contour sub-segment candidate may vary from that of the original speech sub-segment. 
The Examiner points to Gao for disclosing a means for time-warping the start and end points of a 
speech-sub-segment (col.2, line 17 to col. 43, line 14). 
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The Examiner states that it would be obvious for one skilled in the art to modify the 
approximation method used by Lee using the time- warping method in Gao in order to implement 
an efficient pitch contour coding process. 

Applicant respectfully disagrees. 

The speech coding process according to Gao 

It is respectfully submitted that Gao discloses a method of speech processing wherein the 
encoder performs high-pass filtering and applies a perceptual weighting filter for providing 
weighted speech signal, and a pitch preprocessing operation is applied to warp the weighted 
speech signal in order to match the interpolated pitch values that will be generated by the 
decoder (col. 5, lines 52 to 65). Gao uses high-pass filtering, perceptual weighting and speech 
signal warping to support lower bit-rate encoding modes. All these three steps are necessary to 
produce a linear pitch lag contour (see Figure 8c) from a non-linear pitch lag contour (Figure 
8b). In fact, Gao discloses a method where the encoder generates a pitch lag contour by using 
estimates of a previous pitch lag and a current pitch lag of the speech signal and then warp the 
speech signal by temporally deforming the weighted speech signal in order to conform to the 
generated pitched lag contour (col.70, lines 47-53; col. 71, lines 10-17). 

The speech coding process according to Lee 

Lee discloses a method to substitute non-linear contour segments with linear contour 
segments. Lee simply picks a start point in the pitch contour and searches for an end point in the 
pitch contour that produces a linear segment having an error from the original contour segment 
smaller than J max . 

The differences between the approaches in Lee and Gao 

Lee's coding method is contour-wise rather than frame-wise (last paragraph of left 
column on page 486). While Gao also discloses a method to substitute non-linear pitch lag 
contour segments with linear pitch lag contour segments, Gao's coding method is frame-wise or 
subframe-wise (col. 42, lines 17 - 34). Gao uses high-pass filtering of the speech signal and a 
perceptual weighting filter for providing weighted speech signal. Lee does not use those steps. 
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There would be no motivation or incentive to combine the approaches in Lee and Gao 

In order to raise a 103 rejection, the Examiner must show why a person skilled in the art 

would want to apply the method as disclosed in Gao to the method as disclosed in Lee. The 

Examiner fails to show such motivation. 

There are many reasons why there would be no motivation to combine the approaches in 

Lee and Gao. 

A. Complexity 

As admitted by the Examiner, Lee uses the following four steps for pitch contour coding: 

1) creating a plurality of simplified pitch contour segment candidates, each candidate 
corresponding to a sub-segment of the audio signal (Section V.A., pages 486-487); 

2) measuring deviation between each of the simplified pitch contour segment candidates 
and the pitch values in the corresponding sub-segment; and 

3) selecting a plurality of consecutive segment candidates to represent the audio segment 
(Section V.A., Pages 486-487; Figure 5); and 

4) coding the pitch contour data in the sub-segments of the audio signal corresponding to 
the selected segment candidates (Section V. Page 486). 

Gao uses a different approach in pitch contour coding. Gao s process involves at least 
three steps (col.5, line 52 - 64): 

a) high-pass filtering the speech signal; 

b) applying a perceptual weighting filter to the high-pass filtered speech signal for 
providing weighted speech signal, and 

c) warping the weighted speech signal in order to match the interpolated pitch values that 
will be generated by the decoder. 

None of the steps in Gao are used in Lee. Thus, in order to combine the method as 
disclosed in Gao to the method as disclosed in Lee, one must use all of the seven steps as shown 
above. The combined method requires a very complex encoder. 
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B. Compatibility 

As mentioned earlier, Lee 's coding method is contour-wise rather than frame-wise, 
whereas Gao's coding method is frame-wise or subframe-wise. Furthermore, in order to time- 
warp the pitch lag contour, Gao requires applying a perceptual weighting filter to provide 
weighed speech signal. It is uncertain whether Lee can use the weighed speech signal to create a 
plurality of simplified pitch contour segment candidates, each candidate corresponding to a sub- 
segment of the weighted speech signal, and then measure the deviation between each of the 
simplified pitch contour segment candidates and the pitch values in the corresponding sub- 
segment. 

C. Lee alone can accomplish what the combination of Gao and Lee may provide 

Gao uses a time-warping method to replace non-linear pitch lag contour segments with 
linear pitch lag contour segments. The objective is to lower the coding bit-rate so as to meet a 
certain encoding mode. 

Lee alone can lower the coding bit rate to meet a certain encoding mode by changing 
^max- For example, if a high bit-rate is available, Lee may use a smaller tfmax to improve the 
encoding accuracy. But when a lower bit-rate is required, Lee can use a larger rf max in order to 
reduce the number of linear pitch contour segments. There is no need to introduce three 
additional steps as required in Gao, 

D. Gao 's approach is not beneficial to the present invention 

The present invention can lower the coding bit-rate to meet a certain encoding mode by 
changing the predetermined error value in the comparison step 508 as shown in the flowchart 
500 of Figure 4 (p.l 1, lines 17 - 24). There is no need to use the time warping techniques as 
disclosed in Gao. 

Lee, in view of Gao. fails to render the present invention obvious 

In sum, Lee does not require the approach as used in Gao in order to meet a certain bit- 
rate requirement. The present invention does not require the approach as used in Gao in order to 
meet a certain bit-rate requirement. Lee and Gao may not be compatible to each other. Even if 
they are compatible, the combination of Lee and Gao yields an unnecessary complex encoding 



11 



10/692,291 
944-003.191 

system. The Examiner fails to show why a person skilled in the art would choose such a ""\ | wt/ o p 
complex encoding system when a much simpler encoding system can achieve the same result.^) ^ftvjhw\ 

For the above reasons, it is respectfully submitted that Lee, in view of Gao, does not 
render the invention as claimed in claims 1, 1 1, 17 and 20 obvious. 

As for claims 2-5, 7-10, 12 and 15, they are dependent from claims 1 and 1 1 and recite 
features not recited in claims 1 and 1 1 . For reasons regarding claims 1 and 1 1 above, it is 
respectfully submitted that claims 2-5, 7-10, 12 and 15 are also distinguishable over the cited Lee 
and Gao references. 

At section 5, claim 6 is rejected under 35 U.S.C. 103(a) as being unpatentable over Lee, 
in view of Gao and further in view of Swaminathan et al (U.S. Patent No. 5,704,000, hereafter 
referred to as Swaminathari). 

The Examiner cites Swaminathan for disclosing a means for selecting from a plurality of 
pitch candidates corresponding to pitch parameters of a specific pitch period. 

It is respectfully submitted that claim 6 is dependent from claim 1 and recites features not 
recited in claim 1. For reasons regarding claim 1 above, claim 6 is also distinguishable over the 
cited Lee, Gao and Swaminathan references. 

At section 6, claims 13-14, 16, 18, 19 and 21-24 are rejected under 35 U.S.C. 103(a) as 
being unpatentable over Zee, in view of Gao and further in view of Lumelsky (U.S. Patent No. 
6,246,672). 

The Examiner cites Lumelsky for disclosing a storage means for storing encoded audio 

data. 

It is respectfully submitted that claims 13-14, 16, 18, 19 and 21-23 are dependent from 
claims 11,17 and 20 and recites features not recited in claims 1,11 and 20. For reasons 
regarding claims 1,11 and 20 above, claims 13-14, 16, 18, 19 and 21-23 are also distinguishable 
over the cited Lee, Gao and Lumelsky references. 
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As for claim 24, it claims a communication network comprising a decoder as claimed in 
claim 17. For reasons regarding claim 17 above, it is respectfully submitted that claim 24 is also 
distinguishable over the cited Lee, Gao and Lumelsky references. 



As amended, claims 1-24 are allowable. Early allowance of all pending claims is 
earnestly solicited. 
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A Very Low Bit Rate Speech Coder Based on a 
Recognition/Synthesis Paradigm 

Ki-Seung Lee, Member, IEEE, and Richard V. Cox, Fellow, IEEE 



Abstract— decent studies have shown that a concatenate ve 
speech synthesis system with a large database produces more 
natural sounding speech. We apply this paradigm to the design 
of improved very low bit rate speech coders (sub 1000 b/s). 
The proposed speech coder consists of unit selection, prosody 
coding, prosody modification and waveform concatenation. The 
encoder selects the best unit sequence from a large database and 
compresses the prosody information. The transmitted parameters 
include unit indices and the prosody information. To increase 
naturalness as well as intelligibility, two costs are considered in the 
unit selection process: an acoustic target cost and a concatenation 
cost. A rate-distortion-based piecewise linear approximation is 
proposed to compress the pitch contour. The decoder concatenates 
the set of units, and then synthesizes the resultant sequence of 
speech frames using the Harmonic-!- Noise Model (HiSM) scheme. 
Before concatenating units, prosody modification which includes 
pitch shifting and gain modification is applied, to match those 
of the input speech. With single speaker stimuli, a comparison 
category rating (CCR) test shows that the performance of the 
proposed coder is close to that of the 2400-b/s MELP coder at an 
average bit rate of about 800-b/s during talk spurts. 

Index Terms— Con eaten ative speech synthesis, piecewise linear 
approximation, rate distortion theory, very low bit rate speech 
coding. 



I. Introduction 

CONTEMPORARY speech coders such as CELP, MELP, 
MBE, or WI provide good quality speech at bit rates as 
low as 2400 b/s. However, for very low bit rates on the order of 
100 b/s, these coders are unable to produce high quality speech, 
due to the reduced number of bits available for accurate mod- 
eling of the signal. In an effort to overcome this limitation, a new 
speech coder is proposed. This coder employs a different para- 
digm than conventional speech coders and is meant for appli- 
cations where there are no delay or complexity limitations. For 
example, such a coder is very useful when requiring storage of 
large amount of pre-recorded speech. A talking book [4], which 
is a spoken equivalent of its printed version, requires huge space 
for storing speech waveforms unless a high compression coding 
scheme is applied. Similarly, for a wide variety of multimedia 
applications, such as language learning assistance, electronic 
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dictionaries and encyclopedias there are potential applications 
of very low bit rate speech coders. 

Techniques for a very low bit rate speech coder are based on 
what has been learned from previous work in speech coding, 
text-to-speech (TTS) synthesis, and speech recognition. Several 
groups of researchers have worked on a TTS-based approach. 
In TTS, synthesized speech can be produced by concatenating 
the waveforms of units selected from a large database. Prosody 
modification is often included as a post-processor for TTS sys- 
tems. This typically adjusts the time scale and/or pitch to modify 
the prosody. Thus, a TTS-based coding scheme can be thought 
of as a speech coder that has a very large codebook composed of 
raw speech signals with additional parameters for compensating 
prosodic difference between the synthesized and the original 
speech signal. The first study using this approach was performed 
by Gerard et ai [3]. In this work, a text message and spoken ut- 
terance are jointly used to provide a TTS input stream and a 
small number of prototype pitch patterns and duration patterns 
are used for prosody coding. Bradley [4] introduced a wide-band 
speech coder which uses TTS to generate synthetic speech from 
text and then uses speech conversion to convert voice character- 
istics including speaking style, and emotion. This coder operates 
at 300 b/s. However, both these two coders necessarily require 
text transcription. 

A speech coding system based on automatic speech recog- 
nition and TTS synthesis, which employed hidden Markov 
model (HMM)-based phoneme recognition and pitch syn- 
chronous over lap addition (PSOLA)-based TTS was proposed 
by Chen et ah [6]. This coder is referred to as the "phonetic 
vocoder" where the individual segments are quantized using a 
phonetic inventory. The reported bit rate was 750 b/s and the 
reconstructed speech quality was above a mean opinion score 
(MOS) of 3.0. For all TTS-based coders, since a speech signal 
is produced by TTS, the quality is highly dependent on the 
performance of the underlying TTS. 

Alternatively, a segmental vocoder is also proposed to achieve 
very low bit rate. This coder attempts to decompose a speech 
signal into a sequence of segments that are subsequently quan- 
tized using a codebook of pre-stored segments. A typical ex- 
ample of this type of coder is the waveform segment vocoder by 
Roucos et al [20]. In this, segmentation was performed in a very 
simple way, detecting regions with large spectral time-deriva- 
tion, then a template sequence is constructed by minimizing dis- 
tortion between a time-normalized template and an input seg- 
ment. Since each template is a waveform segment that contains 
an excitation component as well as a spectral envelope, this 
coder does not need to transmit excitation signals which gen- 
erally require a lot of bits in conventional speech coders. The 
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bit rate of this coder is about 300 b/s. They obtained signifi- 
cantly less buzzy quality than their previous segmental coder, 
but, there were still artifacts in the coded speech signal, such as 
a "choppy quality," which mainly comes from the simple seg- 
mentation method. One of the limitations of this coder is the 
size of the template table, which is 9000. This number is not 
sufficient for representing the variability of the segments even 
though prosody modification is exploited to compensate for the 
difference between a template and an input segment. 

A segmental vocoder using HMM-segmentation was pro- 
posed recently [1] in which template tables are constructed by 
a series of procedures, temporal decomposition, vector quan- 
tization and multigram segmentation. Each template segment 
is represented by a HMM. This approach is similar to the 
HMM-based phoneme recognition, but nonsupervised training 
was applied, thus the resultant segments do not correspond to 
phonetic inventories. This work mainly focused on the encoder 
part. Manipulation of prosody information was not discussed. 

Although all of these methods are successfully applied to give 
extremely low bit rates, the common problem is that the quality 
of these coders is not satisfactory compared to conventional low 
rate coders (>1000 b/s) even when coding strategy focuses on 
a single speaker's voice. The quality of these coders is often 
not consistent and intelligibility is very bad at times. There are 
several reasons for this, including the relatively few templates 
in a typical system, the distortion introduced by using a speech 
representation that does not code speech transparently, audible 
discontinuities introduced by concatenation at segment bound- 
aries, and the artifacts introduced by time scale modification. 

The main goal of our work is to develop a speech coder whose 
quality is comparable to a conventional low rate speech coder 
(for example, a standard coding scheme at 2400 b/s), while 
maintaining bit rates lower than 1000 b/s. The basic idea is moti- 
vated by waveform-concatenation TTS systems, where a speech 
signal is produced by concatenating a selected unit sequence 
[15]. We utilize a large TTS labeled database as the "codebook" 
for our speech coder. The codebook contains several hours of 
speech, typically filled with phonetically balanced sentences. 
The identities of the phonemes, their durations, their pitch con- 
tours and all speech coding parameters are included in the data- 
base. Our approach to unit selection is different in that we use 
a frame as the basic synthesis unit and introduce a concatena- 
tion cost in order to reduce the distortions between neighboring 
units. This frame-based approach has the advantage that we can 
accurately choose units with short unit length. In addition, since 
frame selection does not require a time-warping process, we 
can synthesize the speech signal without time scale modifica- 
tion. The bit rate of a frame-based approach will be greater, be- 
cause longer segments contribute to the high compression ratio 
of segmental coders. To cope with this problem, we design a se- 
lection process that increases the number of consecutive frames 
and subsequently apply run-length coding. 

The remaining issue associated with a very low rate coder is 
the accurate coding of the pitch(F0) contour. This plays an im- 
portant role in a very low rate coder since the correct pitch con- 
tour will increase naturalness and an efficient coding scheme 
will provide high coding gain. Nevertheless, most of the pre- 
vious very low rate coders neglect this important issue. A pos- 
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sible way to reduce the number of bits in the pitch contour 
coding is to use schemes relying on a parametric description of 
the pitch contour [7]-{l3]. In a parametric model, a segmental 
pitch contour is represented by a function and appropriate vari- 
ables. The resulting information for representing the pitch con- 
tour is very small. Studies in this direction have been performed 
in applications requiring simpler representation of the pitch con- 
tour, such as intonation pattern analysis [8], [9], [1 1], [12] and 
automatic generation of the pitch contour in TTS systems [7], 
[13]. However, fundamental issues for application to a coding 
paradigm, such as the number of bits for representing model 
parameters and quantitative analysis of model error according 
to bit allocation have not been discussed. 

The principle of our coding scheme is piecewise linear ap- 
proximation of pitch that replaces the original pitch contour 
by consecutive lines. Techniques that minimize overall bit rate 
while maintaining approximation error below a given threshold 
will be described in more detail in Section V. 

This paper is organized as follows. Section II gives an 
overview of our coder. Section III then describes the unit 
selection algorithm. The compression method of the unit 
sequence is presented in Section IV. In Section V we describe 
an efficient pitch contour coding method. Section VI presents 
the experimental results obtained from single speaker's corpus. 
We then conclude in Section VII with a discussion of the 
significant results and possible extensions. 

II. Overview of the Coder 

In a unit selection-based waveform-concatenating TTS 
scheme, synthesized speech is produced by concatenating the 
waveforms of units selected from a large database [15]. Units 
are selected to produce natural sounding speech of a given 
phoneme sequence predicted from text. This scheme has been 
widely used in several current TTS systems and gives synthetic 
speech that is close to natural. At this point, it can be assumed 
that if we replace parameters from text with those from a given 
speech signal, the resulting speech signal from TTS would 
sound like the input speech signal. This scenario, which is the 
basic scheme of the proposed speech coder, is depicted in more 
detail in Fig. I. 

We use mel-frequency cepstrum coefficients (MFCCs) as fea- 
ture parameters for the unit selection. MFCCs have been widely 
used in both automatic speech recognition and speaker identi- 
fication tasks. MFCCs as a unit selection parameter can pro- 
vide reasonable intelligibility. In computing rael-cepstrum coef- 
ficients, a Hanning window of 25 ins at a frame rate of 100 Hz is 
applied. This means the length of each unit is 10 ms. In [19], the 
inclusion of features relevant to prosody increased naturalness 
of the synthesized signal, because decreased prosodic modifica- 
tion tends to reduce the artifacts of the synthesized speech. How- 
ever, our experiment showed that introducing prosodic feature 
to the selection criteria sometimes produced tower intelligibility 
than the MFCCs-only case. 

There are two databases in Fig. 1. The first one is for the 
unit selection process that contains MFCCs obtained in the same 
way as the feature extraction. The second one contains speech 
waveforms or appropriate coding parameters that are used to 
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Fig. 1 . Block diagram of the proposed coder. 

make the output waveform. The raw speech signal from which 
MFCCs of the first database are computed is the same as those 
in the second database. Transmitted information, as shown in 
Fig. 1 are FO, gain, and unit indices. A unit index actually rep- 
resents the position where the selected unit is located in the data- 
base. 

A primary difference from the conventional coder is that 
we do not use any speech generation framework such as a 
source-filter model. We assume that any speech signal can be 
reconstructed by concatenating pitch modified short-segment 
waveforms that are adequately chosen from the large database. 
Another difference is that since the output speech is produced 
from a separate waveform database, the sampling rate of output 
speech is fully independent of the input speech sampling rate. 
That means, even if the input speech is a narrow band signal 
(i.e., f M = 8 kHz), a wide band signal (i.e., /„ = 16 kHz) can 
be obtained. We used an FO estimation method that is used in 
the waveform interpolation coder [16]. 

III. Unit Selection 

In this paper, the problem of unit selection is formulated 
as how to find the optimal sequence from a large database 
in the sense of minimizing distortion within an individual 
frame and preserving natural coarticulation. That means there 
should be two cost functions in the unit selection process, 
target cost within an individual frame and concatenation cost 
between frames. When synthesizing a speech signal with an 
input feature vector sequence X = X!,X2,. - . ,Xr by one 
from the synthesis database U = u*, U2, . . . , ur, the total 
cost, C(X, U), is defined by summing the acoustic target cost, 
CU(*< , u<), and the concatenation cost, Cc(u ( -! , u,) 

T T 

W U) = Y, u t )C^(x t , u t ) + £ C c (ut-i , u e ) 



t=2 



where 



ai(x £ ,uO = £(c Xti , -c^,) 2 
t=l 

n 



(2) 



(3) 



t-i 



where Cx t t . and Cu lf represent the i-th MFCC of the fth input 
frame and the unit u t , respectively, and n is the order of the 
MFCC. In (1), uj(x £; u t ) represents fundamental frequency 
(FO) penalty at time t which increases the cost of selecting 
units with different FOs than the input. A possible penalty is 
given by 

-(x„u 4 ) = a(l + |log(^)|) (4) 

where F0 X( and F0 Ut represent the fundamental frequency of 
the ton input frame and the unit u it respectively. There is a spe- 
cial condition for the concatenation cost. Cc(ut_i,u t ) is de- 
fined to be zero, if u t _i and u £ are consecutive in the database. 
This encourages the selection of consecutive frames in the data- 
base which have natural coarticulation. In (3), the amount of 
unnaturalness between neighboring frames is assumed to be the 
Euclidean distance between their MFCCs. This is a reasonable 
assumption because a smoothly evolving spectral envelope over 
time increases the reconstructed speech quality [21]. Further im- 
provement can be obtained by introducing an auditory-based 
distance measure with application to concatenate speech syn- 
thesis [22], The optimal unit sequence U* is obtained by mini- 
mizing the total cost C(X,U) 



U*=Mgjnin C(X,U) 



(5) 



0) 



where U r is the set of all possible sequences that have T-units. 
This minimization can be performed by a Viterbi search pro- 
cessing one input unit at a time. Let u, (t) be the ith unit at time 
t, the forward recursion is as follows: 

C t (i) = vgfaiCt-iU) + Cc(u<-iO),ut(i))} 

+ ^(x ( ,ut(i)) 

*<(0 8 ^ 1 ^{c^iO0 + c c (iii.iO0>ut(O)} W 

where 1 < i < A r , 1 < t < T> and <I' £ (i) is the back- 
tracking pointer for the ith unit at time f, C,(i) is the accumu- 
lated cost for the ith unit at time t, and Cc(u t -i(j) , u £ (i)) = 0 
if Ut_i(/),ut(t) are consecutive in a database. 

After the final accumulated costs Cr(i) for all i 
have been computed, the best unit sequence U* = 
u i(0i): u 2(92)>-"> u r(9f) is obtained using the following 
backward recursion: 

fr BS «g 1 gJg v Cir(0 

= (tf+i). r = T - 1,T - 2, . . . , 1. (7) 
Another criterion is to find the best sequence in the sense of 
maximizing the number of consecutive frames as well as mini- 
mizing the cost function. Before computing costs in (6), we find 
the maximum accumulated number of consecutive frames and 
which paths have this maximum number. Finding a minimum 
cost path is then performed by the above forward recursion. This 
can be implemented by introducing an accumulated number of 
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consecutive frames, L t (i) for the tth candidate at time t to the 
Viterbi decoding. The modified forward recursion is as shown 
in (8) at the bottom of the page, where 1 < i < ;V, 1 < t < T, 
and l t (i y j) is the accumulated number of the consecutive frames 
up to unit u t (t) and a set A t (i) contains previous unit indices 
which have the maximum consecutive frames up to unit u t (i). 
This significantly improves the performance of the coder in 
quality and bit rates, since longer consecutive frames preserve 
natural coarticulation speech, and the efficiency of the subse- 
quent run-length coding is increased when the number of con- 
secutive frames is long. 

In the above equations, any unit is assumed to be chosen from 
iV units in the database. Because of the large size of the database 
(in this work, about 460 000, so N = 460 000) the Viterbi search 
must be pruned to reduce the computational time. A pruning 
strategy will be described in the following sections. 

A. VQ-Based Candidate Selection 

In TTS, the number of possible units at a time is limited by the 
phoneme identification. A similar approach is employed in this 
paper, we focus on the limited number of units whose spectral 
envelope is relatively close to that of input frame. Since a set of 
the units close to the input frame occupies only a small portion of 
the entire database space, this can significantly reduce the com- 
putational complexity. This process requires partitioning the en- 
tire database space. We used vector quantization (VQ) for clus- 
tering. Supposing that a given unit is vector quantized by a spe- 
cific code vector, the units quantized into the same code vector in 
a database are selected as candidate, frames. If each frame corre- 
sponds to a phonetic inventory, the codebook size is six bits, or 
64, which is nearest the number of phonemes, 5 1 . An experiment 
showed that this number is too small to represent the variability of 
the frames and results in poor performance. Experimentally, we 
obtained good results when the codebook size was 10 or 1 1 bits. 

This simple method has a problem due to the hard-clustering 
property of VQ [21]. As described in Fig. 2, when an input unit 
is close to a border of the space partitioning, more adequate can- 
didate units may not be selected. To alleviate this, it is neces- 
sary to choose more than one cell. Well-known soft clustering 
techniques, such as Gaussian mixture model (GMM) or fuzzy 
clustering can be considered to choose the multiple cells. In our 
method, a relative distance measure is used. The Euclidean dis- 
tance between the input and the VQ centroid is computed and 
reordered. Then, the ith cell is selected if the ratio aU/d,^ is 
greater than a given threshold (typically 0.7), where <U is the 





Fig. 2. VQ-bascd candidate selection. Three cells, C\, C,, and C a , are the 
candidate cells in this example. 



distance between the input and the centroid of the ith cell and 
d min is the minimum of di. 

The number of candidate units depends on the number of can- 
didate cells. In order to keep computational complexity from in- 
creasing greatly, we limit the number of candidate cells to six. 
The final procedure for selecting candidate units is to pick units 
within a hyper sphere with the input vector as a center. The ra- 
dius determines the maximum allowable error of the candidate 
selection, which is closely related to the number of candidates. 
We determine the maximum allowable error by the bisectional 
search algorithm which is known as a fast search algorithm. 
The algorithm finds the maximum allowable error iteratively 
until the number of candidates reaches the desired number, A r <£. 
Let us describe the algorithm for taking N d candidates within a 
threshold Thres. 

1) Compute the acoustic target costs, CU(£i,ti») of all the 
candidates within a candidate region. 

2) Set initial Thres, N d and AThres = 0.5 * Thres. 

3) Count iV = The number of candidates whose 
Ca(*i,«*)< Thres. 

4) If A r < N d at the first iteration or |;V - X d \ < 6 t stop 
iteration. 

5) If A T > A r d , Thres = Thres - AThies, otherwise, 
Thres = Thres + AThres. 

6) AThres = 0.5 * AThres, go to step 3. 

Note that since AThies decrease exponentially, a small 
number of iterations is required in this procedure. This means 
that the above method is much faster than a full sorting-based 
selection method. 



w \0, otherwise 



) are consecutive in a database 



l<j<N 



»?(0 = 1 ™« f 'i(»".i) 

= arg nua{Ci-i0') + <fc(ui-i(j). «*(«))} 
C t (i) = inin{C,-i0) + Cfefc-iGWO)) + Cu(x t) u t (t)) 



(8) 
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B. Context-Based Viterbi Pruning 

The typical number of candidates after VQ-based candidate 
selection ranges from 500 to 1000. This number corresponds to 
only 0. 1% or 0.2% of the entire units in the database. However, 
the Viterbi decoding process still requires lots of computation. 

The pruning strategy of this section is based on a contex- 
tually meaningful criterion. Since the proposed speech coder 
uses a ITS database, which is already phonetically labeled, we 
can predict whether a given path is possible in some context. 
For example, assuming that a database is labeled using half 
phones, the following frame of the current frame labeled with 
"aal" (aal is the first half of phoneme aa) must have phoneme 
"aa!" or "aa2" (aa2 is the last half of phoneme aa). All other 
combinations, like aal-ael or aal-k2, must be removed. In this 
way, wc can reduce the number of paths in the Viterbi algo- 
rithm. Experimental results showed that approximately 50% of 
the total number of paths have context ually legal combination 
of phonemes. This means by using this pruning, the amount of 
computation for concatenation cost can be reduced by 50%. The 
sound quality after this pruning process was almost the same or 
somewhat better than that from the method without pruning. In 
terms of computational complexity and size of the memory, this 
pruning process requires just one character comparison and no 
additional memory. 

IV. Coding the Selected UNrr Sequence 

Since the concatenation cost in (1) is set to zero if two frames 
are consecutive in a database, the resulting unit sequence has 
many consecutive frames. To take advantage of this property, a 
run-length coding technique is employed to compress the unit 
sequence. In this method, a series of consecutive frames are rep- 
resented with the start frame index and the number of the fol- 
lowing consecutive frames as shown in Fig. 3. Thereby a number 
of consecutive frames are encoded into only two variables. 

The coding efficiency of the example in Fig. 3 is (437 - 
198)/(437) *100 = 54.7%. In this example, we assigned 19 bits 
for start frame index because of log 2 460 K ^ 19 (bits). How- 
ever, the possible units are limited by the phone index of the pre- 
vious frame, as described in Section IU-B, the actual number of 
possible units is less than the total number of units. The number 
of bits for a start frame index is determined according to the 
phone identification of the last frame. For example, if the last 
frame has a phone "aal" and the number of occurrences of aal 
in a database is 10 K, the required bit for the following frame 
index is log 2 10 K 2 14 (bits), instead of 1 9 bits. 

As for the bits for quantizing a length of consecutive units, 
which is referred to as a run-length in this paper, a variable bit 
allocation proved to be more efficient than a fixed bit allocation. 
Experimental results showed that about 30 b/s were saved by 
using Huffman coding. This is ensured by the histogram of the 
run-length in Fig. 4. The corresponding Huffman code table is 
also shown in Fig. 4. As shown, the smaller number of bits are 
allocated for the shorter run-length, 

V. CODING THE F0 CONTOUR 

In order to get a high compression ratio, our F0 coding is 
contour-wise rather than frame-wise. Piecewise linear approx- 



]««•— The Number of the following consecutive 

frames 
Start Frame No. 



A unit sequence (Before coding) 

189014 189015 189016 454609 154299 369469 

369470 369471 417562 417563 417564 417565 

417567 417568 417569 417570 417571 335566 

335567 335568 368826 368827 

20204 (Total 19*23*437bits) 

After run length coding 

189014 3 454609 1 154299 1 369469 3 417562 4 
417567 5 335566 3 368826 2 20204 1 (Total 
22*9«198bits> 

Fig. 3. Bit fields for a (top) consecutive unit sequence and (bottom) an 
example. 




ran length 


code 


1 


000 


2 


10 


3 


01 


4 


110 


5 


1110 


6 


0010 


7 


1 1 1 10 


8 


mil 


9 


001 10 


10 


001110 


11 


onmio 


12 


OOIIIUOOO 


13 


001 11 111 


15 


001 III 101 


IS 


ootmiooit 


Escape 


ooimiooio 



Fig. 4. (left) Histogram of ran -length and (right) its Huffman code table. 

imation (PLA) [12], as shown in Fig. 5, is used to implement 
our contour-wise F0 coding. PLA seems to be very favorable 
for high compression, because we need transmit only a small 
number of sampled points instead of all individual samples. Of 
course, the intervals between the sampled points must be trans- 
mitted for proper interpolation. In general, the total number of 
bits for PLA is smaller than frame-wise coding. 

PLA always presumes some degree of smoothness for 
the function approximated. Therefore, we apply a median 
smoothing filter to the F0 contour before compressing it. 
Gross representation of the F0 contour by piecewise linear 
approximation causes larger coding errors than frame-wise 
coding. This error depends on how to select FO samples as 
endpoints of the approximation lines. Therefore, an optimizing 
PLA is formulated for finding the locations of F0 points by 
minimizing the error between the F0 contour and the approxi- 
mation. Two methods for finding the location of F0 points are 
proposed in this work. In the following sections, we discuss 
these issues in more detail. 

A. Successive Linear Approximation 

The method introduced in this section is close to the polygon 
approximation [18] algorithm applied in image coding appli- 
cations. It was developed for efficient compression of two-di- 
mensional (2-D) polygons. Successive approximation for F0 
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^«u< <Cox Original FO contour 

Linear approximation 



Fig. 5. Piecewisc linearly approximation. 

coding can be thought of as a one-dimensional (1-D) version 
of the polygon approximation. 

Fig. 6 depicts the framework of the successive linear approx- 
imation for the FO contour. Linear approximation is carried out 
using those two contour points with the maximum error between 
them as the starting point. Then, additional points are added to 
the line where the error between the approximated and contour 
are maximum. This is repeated until the contour approxima- 
tion error is less than c^nax- The resulting approximated contour 
guarantees that the approximation error is below 

This method considers only instantaneous error. However, 
mean squared error is sometimes a more meaningful criterion 
than instantaneous error. Overlooking this measure causes 
larger mean squared error in some regions, even if a small <f * nax 
is met. To alleviate this problem, we modified the successive 
approximation mentioned above to achieve better performance. 
In the modified method, the approximation is carried out 
according to the following steps. 

1) Compute the mean square error for each line, and find 
the line with maximum mean squared error among all 
approximation lines. 

2) For the line with maximum mean squared error, pick the 
point with maximum error between the original contour 
and the approximated line. 

3) If the maximum error is greater than dj lax , go to Step 4), 
otherwise stop approximation. 

4) Add point from Step 2) to the line from Step I), and go to 
Step(l). 

Note that the mean squared error criterion is used for pre- 
selection of the FO point for linear approximation. This leads 
to regions with high fluctuation that are subsequently piecewise 
linearly approximated. According to experiments, even with the 
same threshold, the mean squared error over the whole FO 
contour is further reduced by the modified algorithm. 

Determining the threshold error <£ nax is extremely crucial, 
as this value affects both the number of bits and the per- 
ceptual quality. During subjective evaluations of synthesized 
speech signals, it was found that allowing a maximum error 
°f ^m&x = 5 Hz for a female talker is sufficient to allow 
proper representation of the FO contour as well as obtaining a 
reasonable bit rate. 

The B-spline approximation for FO contour was also consid- 
ered in this work. Visual inspection revealed that the approxi- 




Fig. 6. Successive linear approximation for FO contour. 

mated contour by B-spline was closer to the original FO con- 
tour due to its smoother representation of the contour. However 
there was no clear perceptual difference. Hence, we concluded 
that linear approximation is good enough for representing FO 
contour. 

B. Linear Approximation Based on Rate-Distortion Criterion 

In this section, we propose an optimal method that takes into 
account not only the approximation error but also the number 
of bits. The method is implemented based on rate distortion cri- 
teria. 

LetP = {po,...,PA r p-i} denote the set oiFO points used to 
approximate the contour, which is also an ordered set, with N Pl 
the total number of FO points in P, and the k-\h line starting at 
Pk-i and ending at p*. Since P is an ordered set, the ordering 
rule and the set of points uniquely define the approximated con- 
tour. 

Now, we define a constrained minimization problem 

Minimize R(P) subject to Aiwx(P) < <Cax (9) 

where <ft(P) is the total number of bits needed to encode the 
FO set P including values and positions, and D mHX (P) is the 
overall maximum absolute error defined by 

Anax(P) = dmax(P*-l,P*) (10) 

fc€|l,...,JV*-l. 

where dmax(Pfc-iiPfc) is the maximum absolute error between 
the line p*_i to pu and actual FO values. Note that there is an 
inherent tradeoff between R{P) and D m *x(P) in the sense that 
a small D mHX (P) requires a high R(P), whereas a small R(P) 
results in a high D nwx (P). 

To find an easier way to solve the problem, we rewrite R(P) 
in (6) as follows: 

*(P)= £ v(PK-uPk) (11) 

where 

"(p*-X,P*) = ( °?' x tf *~<*-l'»> ^ 

u"* i.w \r{p k -upk), otherwise 

(12) 

where r(p*-i,p*) is the number of bits needed to encode line 
Pk-i top*. 

Now, the problem can be formulated in the form of a di- 
rected graph, as shown in Fig, 7. The vertices of the graph corre- 
spond to the admissible FO points, and edges correspond to the 
possible segments of the approximation line. The edges have 
weights w(p*-i,p*)- The total number of bits R(P) is propor- 
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Fig. 7. Example of directed graph for the linear approximation. The bold line 
means local minimal path. 

tional to the number of points N p . Thus the problem can be 
considered as a problem of finding a shortest path. Note that 
the above definition of the weight function leads to a length 
of infinity for every path that includes a line segment resulting 
in an approximation error larger than <£ IWX . We can find an 
optimal path by exhaustive search for all possible set P = 
{Poi • • • iPn p -i}- However, this is not a practical way because 
of the quite expansive computational cost. As an alternative, dy- 
namic programming is employed. It first finds a "local minimal 
path*' for all F0 points within a syllabic contour, then the global 
minimum path is built by backtracking. The overall procedure 
for finding an optimal set of FO points P* = {j^, . . . ^..J 
is as follows: 

w(n) = ^vfaJLa k9n [R(p k ) + r(p*,p n )]} 
R(j>n) = {R (?«,(„)) + r (p w( „), j? n ) } . 

Anaxfrn) = mAx{D max ( Pw{n) ) , rf max (p w(n) ,p n ) }(13) 

where 1 < n < N - 1 and A r is the total number of FO samples 
within a syllabic contour, R(pk) is the accumulated number of 
bits up to pk> similarly, D lTmx (pk) is the maximum error up to 
Pa-» and a* in is given by 

,n \ 1, otherwise. 

The backtracking pointer, w(n) holds an indication of which FO 
point is the start point of the path with the minimum accumu- 
lated number of bits at p n . The optimal sequence of F0 points 
in reverse order is 

Pn > Pw(N) > Pw(w{ jY» i • * • • 

An example is given in Fig. 7. For each point p nt the bold line 
denotes the local minimal path (=10(71)). It can be easily un- 
derstood that the optimal set of FO points after backtracking is 
given by P* = {p^P^Pl}' 

VI. Experiments and Results 

This section presents the experimental results of the proposed 
coder for a single female speaker. The size of the database 
in our work is about 76 min which corresponds to 460 K 
units. For the test, we also prepared 15 test sentences from 
the same talker. 1 All speech signals were recorded at 48 kHz 
in a noise-free environment, and low pass filtered to 7 kHz, 
then down-sampled at 16 kHz, Twenty-one MFCCs including 
the zeroth coefficient were computed for unit selection. A 

'The database docs not include these test speech signals. 



pre-eraphasis factor of 0.95 is applied, and the number of 
mel-frequency filter banks are 24. 

First, we evaluate the performance of the proposed -FO coding 
method. Fig. 8 shows the original FO contour versus approxi- 
mated contours for d^ = 5 Hz and d^ = 10 Hz, respec- 
tively. The results in the figure were obtained from rate-distor- 
tion criterion presented in Section V-B. A more coarse represen- 
tation of a given FO contour is found at higher d^ value. Prac- 
tically, setting a maximum allowable error d^ to 5-6 Hz re- 
sults in a perceptually good approximation for a female voice's 
F0 contour. We also encoded a number of F0 contours from the 
15 sentences and averaged the resulting bit rates. The bit allo- 
cation for F0 information is summarized in Table I. The exper- 
iments were performed for various d^ = 1, 2, . • . , 11. The 
results are shown* in Fig. 9. The shape of the resulting curve 
comes up with a general rate-distortion curve even though there 
is no explicit relationship between bit rate and d^. For the suc- 
cessive approximation case in Section V-A, results are almost 
the same as for the rate-distortion criterion, but the bit rate is 
slightly increased (135.9 b/s for the successive approximation 
method and 120.5 b/s for the rate-distortion-based method). 

The average bit rate for each parameter is summarized in 
Table II. This result is also based on the 15 test sentences and 
the bit rate for F0 is from the method based on rate distortion 
w * tn <A*nax = 6 Hz * Tne bits ft> r gain information were deter- 
mined according to the method described in [24], however this 
method originally required phonetic segmentation which is not 
available in this work. Hence, we used a simple segmentation 
method which is based on the voiced/unvoiced decision and the 
first-order orthogonal polynomial coefficients for MFCCs. The 
threshold for detecting segment boundaries was determined in a 
heuristic way which produces the same number as the phoneme 
boundaries. In the unit selection process, the modified forward 
recursion (8) and all the pruning methods described in Sec- 
tion III were used. As shown in Table II, more than 60% of the 
total b/s is occupied by the frame index. This is because the large 
size of the database entails more bits. The subjective listing test 
according to the size of the database will give useful clues to 
help decrease bit rates. 

There are several ways to synthesize speech waveforms from 
the selected unit sequence, such as PSOLA [25], HNM [14], 
and MBROLA [26]. Among them, HNM-based synthesis gives 
good performance for prosody modification as well as concate- 
nation, due to its parametric modeling approach. Hence, we 
adopted it for waveform synthesis. Since the HNM synthesis is 
pitch synchronous, there is time misalignment between the se- 
lected frame unit sequence and the HNM parameter sequence. 
Indeed, the female voice has a generally higher pitch and this 
leads to insufficient frame information when frames represent 
10-ms intervals. Copying or deleting HNM parameters may be 
a solution for this problem. However, this causes annoying dis- 
continuities and buzziness of the synthesized speech signal. In 
order to minimize quality loss at the synthesis stage, we em- 
ployed a multimodal interpolation technique that applies dif- 
ferent kinds of interpolation methods according to the charac- 
teristics of frame joining points. For example, if two frames are 
not naturally concatenated (in other words, the frame indices of 
the two frames are not consecutive) the HNM parameters of the 
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' TABLE I 
Brr Allocation of FO Information 



PARAMETERS 


ALLOCATED DITS 


FO (VALUE) 


8 


FO (DURATION) 


5 


GAIN 


5 



intermediate frames are obtained by the interpolation of those of 
neighboring frames. It is well known that a high degree of dis- 



continuity can be expected when the speech signal changes from 
unvoiced to voiced and vice-versa. In other words, preserving 
the discontinuities at the voicing status changing points provides 
more natural sounding speech. Nearest neighbor search is used 
to find the HNM parameters at joining points where the voicing 
states of the neighborhoods are different from each other. Note 
that MBROLA uses constant frame length at synthesis time, this 
feature will reduce the complexity of HNM synthesis. 

A subjective formal listening test was conducted to compare 
speech quality of the unit selection-based waveform concate- 
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Fig. 9. Rate versus maximum distortion curve. 
TABLE H 

Average Values of Bit Rate for Each Parameter 



PARAMETERS ; 


BIT/S 


FRAME INDEX : 


521.7 


RUN LENGTH 


91.7 


F0 


120.5 


GAIN 


99.5 


TOTAL j 


633.4 



nation and conventional speech coders. The modified forward 
recursion produced much better sound quality than the original 
forward recursion (6), thus we used the results from the modi- 
fied method as test speech signals. Since the goal of this work 
is to produce synthetic speech whose quality is comparable to 
conventional low bit rate coders, overall user acceptability of the 
reconstructed speech has been measured with a comparison cat- 
egory rating (CCR) test [17]. The listeners identify the quality 
of the second stimulus relative to the first using a two sided 
rating scale, as shown in Table III. Thirteen listeners participated 
and were asked to judge which stimulus is better or worse than 
the other. Each stimulus consisted of the 8-kHz downsampled 
reconstructed speech from the proposed coder and the recon- 
structed speech from the 2400 b/s MELP coder [27]. The speech 
was from the test data set. The contents of the test sentences 
are listed in Table IV. There are five different contents. Each 
sentence was uttered three times with three different prosodies. 
Thus, total 3*5 = 15 stimuli were evaluated by each listener. 
The average CCR was -0.28, the maximum CCR was 0.33 and 
the minimum CCR was -0,64. For all five sentences, CCRs 
are less than I . This means the quality of the proposed coder 
is close to that of the 2400 b/s MELP coder. The listeners indi- 
cated that the distortions caused by the two speech coders sound 
different from each other. This is due to the fundamentally dif- 
ferent approaches between the two coders. The major factors 
of quality degradation of the proposed coder are large distor- 
tion between the input cepstrum and the one from the selected 
unit, pitch modification and interpolation of HNM parameters. 



TABLE ni 
Quality Rating scale for a CCR Test 



DESCRIPTION 


RATING 


MUCK BETTER 


2 


BETTER 


1 


ABOUT THE SAME 


0 


WORSE 


•1 


MUCH WORSE 


-2 



TABLE TV 
CCR for Each Test Sentence 



SENTENCE 


RATING 


"Two boyscouts stood watch outside.." 


-0.28 


"Candy's purple gown looks awful." 


-0.18 


"I'm waiting for my pear tree to bear fruit." 


-0.59 


"We must complete every task." 


-0.64 


"He ate too much corn at the picnic." 


0.33 


Average 


-0.28 



Noisy or unclear qualities were sometimes found in unvoiced 
regions. Slight audible discontinuities were also found in the 
speech signal from the proposed coder though a concatenation 
cost is engaged in unit selection. These defects were more vis- 
ible when comparing with the original 16-kHz sampled speech 
signals. However, according to CCR score, it appears that the 
overall quality of reconstructed speech signals is reasonable in 
both intelligibility and naturalness. 

VII. Conclusion 

A very low bit rate speech coder based on a new paradigm is 
proposed in this paper. The objective of this work is to make the 
quality of a speech coder operating at below 1000 b/s close to 
that of conventional low rate coders. The unit selection approach 
which has been widely used in TTS system is a key part of the en- 
coder. An acoustic target cost function related to intelligibility and 
a concatenation cost related to naturalness are applied to unit se- 
lection. A technique which can provide longer consecutive frames 
is also introduced in order to increase sound quality as well as 
coding efficiency. Two pruning methods in a Viterbi decoder are 
introduced to reduce computation times. At the decoder, wave- 
form concatenation and prosody modification are exploited to ob- 
tain the reconstructed speech signal. As a synthesis method, the 
HNM framework is used. Using MFCCs in unit selection was 
motivated by automatic speech recognition and speaker identi- 
fication. As for F0 coding, we introduced linear approximation 
schemes in order to get an extremely low bit rate. A rate-distor- 
tion criterion is applied to the linear approximation. Using this 
criterion, we can implement an optimal method for minimizing 
bit rates with adjustable approximation error. 

The experiment showed the effectiveness of the proposed 
schemes: prosodic information is preserved while F0 and 
gain undergo high compression. In a formal listening test, we 
confirmed that the quality of the proposed coder was very close 
to that of a conventional 2400-b/s coder. 
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This coder is limited to a single speaker's voice. If we limit 
the application to where only one speaker's voice is needed, such 
as a personalized communication system, the proposed coder can 
be successfully exploited. Otherwise, additional effort to achieve 
multiple speaker capability is needed. Increasing the size of the 
database to contain a number of speakers' utterances is a possible 
solution. Although work on voice personality conversion is still 
underway, a future voice personality transformation algorithm 
will be a solution for this multiple speaker capability. 
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